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ABSTRACT 

Online Social Networks (OSNs) are used by millions of users 
worldwide. Academically speaking, there is little doubt about 
the usefulness of demographic studies conducted on OSNs 
and, hence, methods to label unknown users from small la- 
beled samples are very useful. However, from the general 
public point of view, this can be a serious privacy concern. 
Thus, both topics are tackled in this paper: First, a new 
algorithm to perform user profiling in social networks is de- 
scribed, and its performance is reported and discussed. Sec- 
ondly, the experiments -conducted on information usually 
considered sensitive- reveal that by just publicizing one's 
contacts privacy is at risk and, thus, measures to minimize 
privacy leaks due to social graph data mining are outlined. 

Categories and Subject Descriptors 

G.2.2 [Discrete Mathematics]: Graph Theory — Graph al- 
gorithms, Graph labeling; 1.5.2 [Pattern Recognition]: De- 
sign Methodology — Classifier design and evaluation; K.4.1 
[Computers and Society]: Public Policy Issues — Privacy 

General Terms 

Algorithms, Experimentation, Human Factors, Legal As- 
pects 

Keywords 

Online Social Networks, Twitter, graph labeling, privacy 

1. INTRODUCTION AND MOTIVATION 

Graph Labeling is the task of assigning labels to the ver- 
tices or edges of a graph. Because social networks are usually 
represented as graphs, vertex and edge labeling algorithms 
can be applied to them straightforwardly. In the former case 
the individuals in the network are labeled, while in the later 
the labels are assigned to the relationships between them. 
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In that context, labeling algorithms can exploit a property 
of social networks: the tendency of people to relate more 
likely with those sharing similar traits, or homophily. This 
phenomenon is pervasive to very different social networks, 
and it has been revealed that a number of personal charac- 
teristics -such as race and ethnicity, age, religion, education, 
occupation, or sex- induce homophilous relationships [13] . 

Thereby, homophily can be used both to cluster similar 
individuals within a network, or to infer attribute values for 
every individual from his neighbors' characteristics. In the 
first case -community detection- it is not necessary to know 
anything about the members of the social network except 
for their relations. In the second, the attributes of interest 
are needed and, hence, part of the individuals in the network 
must have known values for them -i.e. they must be labeled. 
Thus, from a machine learning perspective, the first is an 
unsupervised problem while the second is semi-supervised. 

It is well-known that online social networks (OSNs) ask 
their users for personal information and that many of those 
users happily provide it. Hence, part of the users in OSNs 
are labeled and semi-supervised approaches can be employed 
to label the rest of the members of the network. 

This paper describes a new semi-supervised algorithm to 
perform user profiling in social networks. A number of ex- 
periments conducted on Twitter data are reported, and the 
privacy implications discussed. Certainly, there exist a num- 
ber of semi-supervised methods to label partially labeled so- 
cial graphs. Hence, a later section reviews those most highly 
related to the method here proposed, and points out the 
main differences between this and them. Moreover, previ- 
ous works regarding the privacy implications of social graph 
mining are also discussed. In that sense, this paper focuses 
on active measures the users can adopt and, thus, it outlines 
a protocol to minimize leakages due to graph data mining. 

2. THE McC-SPLAT ALGORITHM 

McC-SplaB 

is an iterative algorithm to perform vertex 
labeling on a partially labeled social network. It is a mul- 
ticlass classifier, that is, each attribute can have more than 
two classes and, in fact, in addition to the predefined at- 
tribute values -e.g. female and male for sex- an extra class, 
unknown, is also required for each attribute. 

Needless to say, individuals have a number of different 
attributes -e.g. sex, age, or marital status- and, hence, each 

^Mnemonic for Multiclass Classification using Soft Labeling 
Propagation and Automatic Thresholding. 



person would have got a profile comprising such attributes 
with their corresponding values. The McC-Splat algorithm 
can simultaneously propagate the values for each attribute 
in the users' profiles but, for the sake of clarity, the following 
description just covers the single attribute case. 

McC-Splat works on a directed graph G — {V, E,C, A} 
where V is the set of vertices -i.e. individuals, E denotes 
the edges -i.e. relationships between those individuals, C = 
{ci,...,Cm} are the different classes each individual's at- 
tribute can take, and finally A is the set of attribute weight 
vectors for the individuals in V. 

By using attribute weight vectors it would be possible to 
model overlapping classes; nevertheless, all of the experi- 
ments reported here were conducted with disjoint classes. 
Moreover, the weights can be seen as a proxy for the user's 
likelihood to belong to a given class -including the unknown 
one- although weights are not probabilities in a strict sense. 
The following formalization describes the A set: 

A C [0, 1]™+' |m = |C| ,Va e A : \a\ = 1 

Given that G is partially labeled, V is divided in two dis- 
joint sets: the set of known vertices -i.e. those individu- 
als with a known class value for the attribute- and the set of 
unknown vertices -i.e those belonging to the unknown 
class. Because the attribute values can be taken from m 
different classes this can be formalized as: 

V = {v, e V\3j : 1 j ^ m,ai, = 1} 

= {v, G V\^3j : 1< j < m, a;,- = 1} 

Finally, a definition for the neighborhood of each vertex 
is needed. In this regard it must be noted that (1) this 
algorithm assumes directed graphs, and (2) it only considers 
as neighbors of a person those people related to the first one 
via a relationship started by that person. 

For instance, in a phone network those numbers a user 
makes calls to would be neighbors, but not the numbers from 
which he receives calls; in the blogosphere the neighborhood 
would comprise the blogs a given blog links to, but not those 
linking to that blog; in Twitter the neighbors would be those 
users a given user is following, but not his followers. 

Thereby, the neighborhood for vertex Vi £V would be: 

= {vj e V|3e„ e E} 

All of this defines the input graph but not the way in 
which the algorithm works on it. As it has been said, it 
is an iterative algorithm and, hence, at its core there is an 
operation to compute new weights for each vertex attribute 
vector from its neighbors weights in the previous iteration. 
It must be noted that only the weights for vertices belonging 
to are updated, those from the originally labeled set 
are not assigned new weights: 

V^, e : ap) = I E af 
e V'' : a;'*) = ai(°' 



In the previous formalization Z is a normalizer. 

McC-Splat, like other graph iterative algorithms, converges 
after relatively few iterations. Hence, once weight vectors 
have stabilized -or after a predefined number of iterations- 
a large part of vertices in have got weight vectors for the 
attribute of interest. Other algorithms would then assign to 
each vertex the label with the highest weight within the vec- 
tor, or would require an ad hoc threshold to be defined for 
each class value. McC-Splat, instead, introduces two extra 
steps which can be used to achieve automatic thresholding 
in a number of ways. 

First of all, a fictitious sink vertex can be introduced. Such 
a vertex would represent an individual related to every sin- 
gle person within the social network. The weights for that 
vertex are computed after the last iteration and they pro- 
vide a measure of which weights could be expected for a 
user without homophilous relationships. The usefulness of 
such an approach is clear when a large majority of people 
belongs to a single class; if that prevalence is not taken into 
account most of the unknown individuals would be incor- 
rectly assigned to the majority class. This equation defines 
the weight vector for such a sink vertex: 

Once the sink vertex weights are computed they can be 
used in two ways: (1) vertices from can be assigned 
the label with the highest weight which is also above the 
corresponding weight in the sink vector; or (2) vertices from 
can be assigned the label with the weight which most 
largely departs -in percentage value- from the corresponding 
weight in the sink vector. 

The second approach to automatic thresholding requires 
to compute an alternative weight vector for the members of 
. As it has been said, those vertices' vectors have got 
one single component with a unity value -i.e. the compo- 
nent corresponding to the class each individual belongs to- 
and their vectors are not modified as the algorithm iterates. 
However, it is possible to compute from their corresponding 
neighborhoods the weights they would have whether they 
had belonged to : 

e : a'[^+i) = i E af 

By doing that it is possible to produce a reverse-ordered 
ranking of individuals for each of the class values the at- 
tribute can take. That way, instead of defining an ad hoc 
threshold to decide if a weight is high enough to accept the 
induced label, it is possible to find different weights at dif- 
ferent percentile values. 

Thereby, when using McC-Splat it is not needed to take ad 
hoc weight thresholds; instead, the confidence required from 
the labeled output can be chosen. For instance, by choosing 
the 90th percentile only those members of whose weights 
were above 90% of the weights of V'^ members would appear 
in the output labeled set. 

So, in short, McC-Splat comes in the following fiavors: 
(1) Plain-vanilla, the class with the highest weight is as- 
signecfl. (2) Sink-absolute, the class with the highest weight 

^The unknown class is ignored, otherwise all of the vertices 
would remain unknown unless the number of labeled exam- 
ples surpassed the number of unknown vertices. 



and above the corresponding weight within the sink node is 
assigned. (3) Sink-relative, the class with the highest positive 
difTerence against the corresponding weight within the sink 
node is assigned. (4) Percentile, the class with the highest 
percentile -according to the labeled individuals- is assigned. 
Optionally, a minimum value -e.g. 90%- can be forced or, 
otherwise, the unknown class is assigned. 

Now, the algorithm's name should be self-explained: it 
is a multiclass classifier which iteratively propagates weight 
vectors to every node from its neighborhood; because each 
node's vector roughly represent its likelihood of belonging to 
each class, the labeling is performed in a "soft" rather than in 
a "hard" way; moreover, the algorithm provides alternatives 
to automatically determine the most reliable class for each 
node, making ad hoc thresholds unnecessary. 

3. EXPERIMENTAL EVALUATION 

3.1 Dataset Description 

Social network data was needed to test the performance 
of McC-Splat. The graph depicting relationships between 
individuals was essential but, in addition to that, a part of 
those users had to be labeled. 

Ifence, the TwitteiQ dataset collected in [2] was used. It 
comprises 27.9 million English-written tweets published from 
January 26 to August 31, 2009 by 4.98 million users. 

Followers and foUowees for each of the users in that dataset 
were also collected. Links to users not appearing in the 
dataset were disregarded, and isolated users were removed. 
Furthermore, a substantial amount of user accounts were 
suspended at the moment of the graph crawl and, hence, no 
information on them was available. Lastly, because of the 
unavoidable network problems, coupled with the fact that 
the API was pushed a little too far, the information for a 
noticeable amount of users was not eventually crawled. 

Thus, the user graph consisted of 1.8 million users with 
their corresponding links and profiles -i.e. full name, short 
biography, location, etc. Given that at the moment of col- 
lecting the dataset, the number of Twitter users in the U.S. 
was estimated between 14 and 18 million^, and that most 
of the crawled users were supposed to be from the U.S. it 
can be considered a rather substantial sample. 

3.2 Labeling Twitter users 

Unlike other OSNs such as Facebook, Twitter profiles do 
not provide highly structured information; there is no way, 
for instance, to indicate the user's sex or age. Instead, Twit- 
ter profiles consist of the user's full name, location, website, 
and a short biography. All of these fields are free text and 
there is a high disparity in their use. For example, 62.31% of 
the users in the dataset provide a location string, but only 
36.46% provide their full personal name 12 . 

This does not mean that no personal information can be 
extracted from Twitter profiles. Quite to the contrary, using 

•^Twitter is a microblogging and social networking service. 
Users publish short text messages (tweets) which are shown 
to all of their followers. Relationships in Twitter are asym- 
metrical and, thus, a user has got followers and foUowees. 
"^http : //www . socialtimes . com/2009/04/twitter-14- 
million/, http : //mashable . com/2009/09/14/twitter- 
2009-stats/ 



the location, full name, and biography strings, half of the 
users in the dataset were geolocated, the sex of one third 
of them was found, in addition to the age for about 11,000 
[2]. Needless to say, the data was noisy, and the labeling 
methods a bit rough; though, anecdotal evidence revealed a 
quite accurate big-picture of Twitter demographics. 

Therefore, a similar approach was employed to label users 
according to a number of personal traits. In addition to sex 
and age, the following attribute^ were also chosen: political 
orientation, religious affiliation, race and ethnicity, and sex- 
ual orientation. All of them are usually considered sensitive 
information, and most countries have enacted laws against 
discrimination based on any of such attributes. In spite of 
this, many people still feel the need to hide those personal 
details. Thus, it is important to find out the degree in which 
such individuals can be inadvertently exposed because of 
their acquaintances. 

All of the classes, except those corresponding to sex, were 
determined by means of pattern matching (see table \T\ for 
the patterns applied). Firstly, each class name was used to 
obtain a initial list of users. For instance, the patterns demo- 
crat* and republican* were used to find users self-defined 
as Democrats or Republicans. Once there was a preliminary 
list of users for each class, their biographies were mined to 
find the most frequent keywords which could be considered 
indicative of class belonging. That way, for example, pat- 
terns such as lib-dem* or dems* were found for Democrats, 
and conservat* or tea party for Republicans. 

Certainly, such a labeling method is error prone but the 
goal was to obtain the largesl[3 possible labeled set for each 
class and attribute. Because of the nature of McC-Splat, it 
was assumed that large although noisy data was preferable 
to cleaner but small samples. After all, should the results be 
encouraging, better labeling approaches could be used. 

Finally, the labeled sets were split into training and tests 
partitions: the former consisted of a random selection of 
80% of the users in each class and was used as input for 
the algorithm; the later comprised the remaining 20% of the 
users and was left out for evaluation. 

3.3 Results 

McC-Splat was applied to the Twitter graph in each of its 
four different "fiavors" just considering the users in the train- 
ing partitions. That way, labels were obtained for the rest 
of the users in the graph including those in the test parti- 
tions. Then, by comparing the algorithm's class assignments 
for those users with the actual class belonging according to 
their biographies, precision and recall figures were computed 
(see table 21 . For comparison purposes, the performance of 
a random classifier based on the proportion of each of the 
different classes is shown in table [21 

In addition to that first experiment, a second one was 
conducted on another independently labeled set. To that 
end, data was collected from WeFoUovJi^ which is a Twitter 

^AU of the chosen attributes, except for race/ ethnicity, ap- 
pear in Facebook profiles and, thus, it would not be surpris- 
ing to find information on them in Twitter biographies. 
"Table [1] reveals that the number of labeled examples ob- 
tained was rather low for all of the attributes and, unsur- 
prisingly, the more sensitive the attribute, the fewer users 
disclose information about it in their biographies, 
^http : //wef ollow. com 



Table 1: Classes for each of the six personal attributes along the rules applied to label Twitter users according 
to them. All of the labels, except for sex were obtained by pattern-matching the users' biographies. The age 
intervals were those used by [5j. 



Attribute 


Class 


R,ule or pattern 


^ users 


sex 


female 


T_Jser name had to be comp oscd of first and last n ame from the X_J S 
Census. Sex was assigned according to frequency of use of the first 
namp in TT ?! population 


271,539 


male 


384,574 


age 


teenage 


Age was extracted from the user's bio looking for the patterns 
year-old or years old preceded by a number or a numeral. Then, 
ages <18, 18-24, 25-34, 35-49, and >49 were assigned to each class. 


3,483 


youngster 


4,562 


young 


1,911 


mid-age 


663 


elder 


296 


political orientation 


democrat 


democrat*, lib-dem*, libdem*, dems* 


248 


republican 


conservat*, gop, g.o.p., palin, pro-life*, prolife*, republican*, 
right-wing, rightish, tcot, tea-party, teaparty 


2,040 


religious affiliation 


atheist 


agnost*, anti-theis*, antitheis*, ateus, atheis*, athicst*, empiricist*, 
godless*, heathen*, humanism*, humanist*, irreligion*, 
non-believer*, non-thcist*, nonbeliever*, nonthcist*, pagan*, 
rational*, sceptic*, secular*, skepchicks, skeptic* 


330 


buddhist 


buddh*, dhamma*, dharma*, sangha, twangha, vipassana, yoga*, 
yoginis, yogis, zen 


204 


christian 


adventist*, anglican*, baptist*, cathol*, cattolici, christ*, church*, 
evangelical, gospel*, jesus*, lutheran*, methodist*, minister*, 
ministries*, ministry*, pastor*, pentecostal*, preacher*, 
presbyterian*, priest* 


8,103 


Jewish 


circumcision, Israel*. Jerusalem, Jew, Jewish, Jews, Judaism, Jude, 
kosher, rabbi, sephardic, synagogue*, torah, yiddish, zion* 


458 


muslim 


imam, islam*, isulamic*, mosque*, muslim*, quran, salaam, 
tweeplims 


171 


racc/ethnicity 


asian-american 


asian*, chinese-american, filipin*, hindu*, India, indian-american, 
Japan, Japanese-american, korea*, taoism, Vietnam* 


65 


black 


africa*, black, black-american, black-man, black-woman, hip-hop*, 
hiphop* 


202 


hispanic 


amigo, bcllcza, familia, favoritos, gente, hispanic, latina, latino, 
mexico 


6 


n at i ve- amer i c an 


aboriginal, alaska- native, american-indian, first-nation, firstnation, 
indigenous*, native american, native-american 


80 


native- hawaiian 


aloha, hawaii*, honolulu, native hawaiian, native-hawaiian, oahu, 
ohana 


4 


white 


Caucasian, white, white-american, white-man, white-woman 


24 


sexual orientation 


heterosexual 


hetero* 


15 


homosexual 


bisexual*, gay*, glbt, glsen, gltb, homo-*, homosex*, 1-word, 
lesbian*, Igbt, Igbtq, marriage-equality, queer, transgender 


1,471 



user directory where users classify themselves according to 
the topics they are interested in. Each topic is represented by 
a tag, and a list of users following each tag can be obtaineqj. 

Hence, most of the patterns from table [1] were employed 
to obtain lists of users from WeFollow[3. Needless to say, 
not every user in those lists appeared in the Twitter user 
graph and, therefore, those users not appearing in the graph, 
in addition to those already labeled -i.e. appearing in the 
training and test partitions- were removed. 

Performance results on this second dataset for both the 
random classifier and the McC-Splat algorithm can be seen 
in tables |3] and O respectively. 

3.4 Discussion of Results 

As it can be seen from tables |4] and [5] the performance of 
McC-Splat was notably high. Average precision and accu- 
racy figures were quite similar, implying that performance 
across classes within the same attribute is comparable and, 
thus, there was no much bias towards the prevalent classes. 

Attributes such as religtous affiUatton, poUtical orienta- 
tion, sexual orientation, and race and ethnicity achieved 
above 95% precision when evaluating on the test partitions. 
Results in the WeFoUow dataset were very similar, except for 
race/ ethnicity where precision dropped to 50% and accuracy 
to 71%. 

The poorest results were achieved when assigning sex and 
age: 62% and 43% macro- averaged precision, respectively. 
With regards to age, maybe it was problematic because it 
is actually a continuous variable. After reviewing the ac- 
tual classifications it was found that most of the errors were 
due to assigning users to nearby classes -e.g. classifying 
teenagers as youngsters, youngsters as youngs, etc. 

All in all, McC-Splat clearly outperformed the random 
classifier by an exceedingly large margin although, certainly, 
when an attribute has got a clearly prevalent class it is much 
more difficult to outperform it. In the presence of such preva- 
lent classes the random classifier achieved good accuracy but 
also poor macro- averaged precision; McC-Splat, instead, was 
not very affected by such prevalent classes and it exhibited 
comparable precision across classes. 

Regarding the different "flavors", the Plain-vanilla version 
did not outperform the random classifier for prevalent classes 
(e.g. male vs female, young vs the rest of age intervals, 
and christians vs the rest of religious affiliations), and it 
even underperformed when classifying homosexual individu- 
als. The rest of the "fiavors" clearly outperformed the ran- 
dom classifier -even for prevalent classes- and they consis- 
tently achieved high performance figures. Therefore, Plain- 
vanilla could be disregarded and additional experiments are 
required to find which of the other three alternatives can 
be the best choice. In this regard, better labeled data -in 
particular for large majority classes- is also needed. 

4. RELATED WORK 

As it has been said, McC-Splat is a semi-supervised graph 

*For instance, http://wefollow.com/twitter/democrat 
gives access to a list of users self-defined as Democrat, while 
http://wefollow.com/twitter/republicEai provides a list 
of Republican users. 

"Sex and age were not able to be tested with data from 
WeFoUow. 



labeling algorithm based on label propagation. There are 
other algorithms which are somewhat similar and, hence, 
those most highly related are to be briefiy reviewed. 

Maybe the best known iterative graph algorithm is PageR- 
ank [TS] , it computes for each vertex -generally a web page- 
a score which corresponds to its relevance within the net- 
work. Its popularity has spurred the use of similar methods 
in many other scenarios -e.g. to fight spam in the Web [4]. 

With regards to the use of the graph structure to per- 
form classification, one of the earliest works was a hypertext 
classifier [I]. In this case, however, the links were used to 
improve the classifier but other clues -such as the documents 
content- were also required. 

Much more related to McC-Splat are the works described 
in [111 114] . In [14] it is described an iterative application of 
Bayesian classifiers where the objects attributes were mod- 
ified from the inferences made on their neighbors in each 
iteration. In 11' the so-called wvRINPI method is described. 
That algorithm works on undirected weighted graphs and 
just relies on the objects labels and relationships. It esti- 
mates the probability of an object belonging to a given class 
as the weighted proportion of its neighbors that belong to 
that class and, then, the majority label is assigned after each 
iteration. 

Although related, there are several differences between 
wvRN and McC-Splat: the later works on unweighted di- 
rected graphs, labels are not assigned by majority vote but, 
instead, weight vectors are propagated. Besides, in the ab- 
sence of labeled neighbors wvRN assigns label on the basis 
of the class priors -i.e. a random classifier- while McC-Splat 
assigns the unknown class. Finally, the use of a sink node 
and the estimation of weight vectors for the labeled exam- 
ples to perform auto-thresholding are novel additions which 
could be compared to cautious classification |12] . 

As it has been said, data mining users' relationships in 
OSNs raises some concerns and, in fact, this study have ex- 
posed the privacy risks due to the public nature of those 
relationships. Hence, this work has got some points of simi- 
larity with a number of recent studies on privacy in OSNs. 

It has been shown, for instance, that different kind of at- 
tacks can be conducted on the basis of known relationships 
and group memberships |18l I19| . and a number of studies 
have provided additional support for those findings in Face- 
book -e.g. PE]. 

It has been stated that privacy attacks can be successful 
when "as much as half of the profiles are private" [IB]. How- 
ever, this study has revealed that the number of required 
known users is, in fact, much lower -well below 1% for a 
sample of 1.8 million users- and the achieved precision is 
much higher than the one reported in [18]. Thereby, privacy 
issues because of publicizing acquaintances in OSNs should 
be a major concern for their users. 

Finally, a few pertinent works on measures to improve 
privacy in OSNs are referenced to provide context for the 
protocol described in the last section. 

At least two different Facebook applications relying on 
public key cryptography to store obfuscated information in 
the OSN servers have been proposed [51 [TD] . By doing that 
users can still make use of the OSN services but their per- 
sonal information is decrypted on the client side and, thus. 
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Table 2: Performance of a random classifier based on the proportion of each class in the labeled data and 
working on the same labeled data. 



Attribute 


P=R= 




Class 


P=R=Fi 


sex 


]\^icro~av"g, 
Macro-avg. 


5148 
0.5 


female 

male 


0.4139 
0.5861 




Micro-avg. 


0.3116 


teenage 
youngster 


0.3191 
4180 


age 






young 


1751 




Macro-avg. 


0.2 


mid- age 


0.0607 

0271 




Micro-avg. 


0.7693 


atheist 


0.0356 
0.0220 


religious affiliation 






christian 


0.8745 




Macro-avg. 


0.2 


Jewish 

muslim 


0.0494 

0.0185 


political orientation 


Micro-avg. 
Macro-avg. 


0.8068 
0.5 


democrat 
republican 


0.1084 
0.8916 


s(^xiial ()i'i(^iit at ion 


Micro-avg. 
Aiacro-a\-. 


0,!)7!)8 
(),.") 


liotorosoxual 
lioiiiosi;xiia.l 


0.0101 

i'l.nsm 








asian-amcrican 


0.1706 




Micro-avg. 


0.3586 


black 


0.5302 


race/ethnicity 






hispanic 
native-american 


0.0157 
0.2100 




Macro-avg. 


0.1667 


native-hawaiian 
white 


0.0105 
0.0630 



Table 3: Performance of a random classifier based on the proportion of each class in the labeled data and 
working on the WeFollow dataset. 



Attribute 




P R Fi 


Class 


P R Fi 


religious affiliation 


Micro-avg. 

Macro-avg. 


0.6843 
0.2 


atheist 
budhist 

christian 
Jewish 
muslim 


0.0872 0.0356 0.0506 
0.0513 0.0220 0.0308 

0.7741 0.8745 0.8213 
0.0484 0.0494 0.0489 
0.0389 0.0185 0.0250 


ix'ilil ical nili^'uf al ii.u: 


Micro-avK. 
i\la.iaai-a,vy. 


0.7966 
II.'} 


cioniocrat 
I'cpaljlicaii 


0.121:! 0.1084 0,1145 
l).t<7t<7 ().8!)1(> O.liKril 


sexual orientation 


Micro-avg. 
Macro-avg. 


0.9899 
0.5 0.9950 0.6655 


heterosexual 
homosexual 


10 

1 0.9899 0.9949 


race / ethnicity 


Micro-avg. 
Macro-avg. 


0.2119 

0.1667 0.4878 0.2484 


asian-american 
black 
hispanic 
native-american 
native-hawaiian 
white 


0.6022 0.1706 0.2659 
0.1853 0.5302 0.2746 
0.1735 0.0157 0.0289 
0.0391 0.2100 0.0659 
10 
10 



it is inaccessible for the OSN operator. Needless to say, en- 
crypted text is relative easy to detect and, thus, a "hostile" 
OSN operator could disable accounts using such a measure. 

Because of that, it has been proposed to use instead a 
dictionary known to the members of a group 3 . Such a dic- 
tionary would provide a way to replace "atoms" of personal 
information with atoms from other users. For instance, the 
name, age, or sex of a user would be stored in such a way 
that they still resemble personal information but cannot be 
linked to the actual individual. By using the dictionary, the 
group members could translate that fake information into 
the actual attributes of their acquaintance. Purportedly, this 
measure is much more difficult to detect than cryptography 
and, thereby, it could be applied even when using the ser- 
vices provided by "hostile" OSN operators [3]. 

5. IMPLICATIONS AND CONCLUSIONS 

5.1 Implications for Users Privacy 

Users sensitive information, such as political or religious 
beliefs, race and ethnicity, or sexual orientation can be de- 
termined with notable precision from their neighbors with 
rather simple algorithms. Thereby, it does not matter if 
users do not self-disclose personal traits, they can be in- 
advertently exposed because of acquaintances who do not 
conceal such information. 

Most works on privacy in OSNs have mainly focused on 
ways to guarantee that released datasets do not put at risk 
the users' privacy -e.g. [17]. Certainly, such anonymization 
measures may dispel some concerns the operators of OSNs 
can have about releasing data for research purposes. How- 
ever, it is not at all necessary to obtain the data from the 
operator of the OSN, but it is relatively easy to collect us- 
ing the available APIs. Therefore, in spite of anonymization 
methods, users of OSNs are fully exposed to any third party 
aiming to data mine social graphs. 

A trivial solution for that problem would be, of course, to 
disable the APIs. This, however, is unlikely to happen be- 
cause it would be contrary to the interests of the operators 
of the OSNs. In addition to that, it would just make diffi- 
cullF^ for third parties to mine the users data but would not 
prevent the operator of the OSN and licensed third parties 
from doing it. 

A number of works, some of them referenced in the pre- 
vious section, propose users to encrypt the information they 
submit to the system. Needless to say, making the users in- 
formation opaque for the OSN would put at risk their current 
business models which, to a great or lesser extent, revolve 
around marketing and personalization. Thereby, it does not 
seem unreasonable to assume that if encryption went main- 
stream among OSN users, the operators of the services would 
force users to use plain text. 

5.2 Minimizing Data Mining Risks 

So, to sum up, graph anonymization is an unreliable pas- 
sivq__| measure, and heavy use of cryptography, an active 
user's measure, could be easily disallowed by the operators 

Several ways in which an attacker can obtain information 
on network relationships by compromising a number of user 
accounts are described in T,. 

^^Passive, that is, from the point of view of the users. 



of the OSNs. Hence, procedures to minimize privacy leak- 
ages should be active and keep the use of cryptography to 
a minimum; some hints on such a prophylactic protocol are 
provided here. 

First of all, the following protocol has been devised for 
asymmetrical social networks in general, and Twitter in par- 
ticular. Secondly, users are responsible for the information 
they disclose on themselves; that is, the purpose of this pro- 
tocol is not to protect their privacy regardless of their ac- 
tions, but to minimize the likelihood of being exposed be- 
cause of their relationships. In third place, users cannot 
control who is following them but who they follow. It has 
been shown that these relationships are risky and, thus, iden- 
tifiable accounts cannot be used to follow anybody. 

Needless to say, the network is useless if users are isolated 
and, thereby, they need a mechanism to follow other users. 
To that end, a second account is to be used. The nickname 
should be a totally random string, and no information should 
be provided other than a public key. This anonymous ac- 
count -in contrast to the previous identified account- would 
not be used to post messages other than mentions to fol- 
lowees, and it would not accept followers. 

Obviously, using two different accounts would be pointless 
if they can be linked to each other by means of the IP ad- 
dress. Therefore, the anonymous account should connect to 
the service through an anonymizing service such as TorQ or 
K-FQ while this is not necessary for the identified account. 

With regards to message publishing, those not mention- 
ing any account or mentioning an identified account could 
be published unencrypted. After all, users are responsible 
for what they publish on themselves, and cannot control the 
messages other users address to them. However, if the mes- 
sage is a reply from an identified account to an anonymous 
account it should be fully encrypted using the public key cor- 
responding to the anonymous account. The reason for this 
is to avoid eavesdroppers to find out implicit links starting 
on identified accounts. 

The most cumbersome part would be the one regarding 
the exchange of credentials between anonymous and identi- 
fied accounts. Such an exchange would be needed to allow 
users to follow their followers. As it has been said, anony- 
mous accounts are not for publishing messages and, thus, 
they would be of no interest. However, after receiving a 
new follower, that anonymous account is the only piece of 
information the user has got to reach the follower's identi- 
fied account. Hence, the user receiving a new follower should 
publish his or her public key encrypted with the public key of 
the new follower. The follower would publish, in return, the 
nickname for his or her identified account encrypted with the 
public key of the foUowee. At that point, the foUowee could 
use his anonymous account to start following the identified 
account of his new follower. 

Clearly, that chain of actions would allow an eavesdropper 
to link anonymous and identified accounts. Thus, to avoid it, 
the exchange of credentials could be made at pre-scheduled 
hours. In addition to this, it must be clear that this pro- 
tocol does not aim to maintain users anonymous from each 
other but to conceal their relationships from third parties 
observing the social network -including its operators. 



'https : //www. torproject . org/ 
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Finally, all of these measures should be implemented by 
client software in such a way that the user could use the OSN 
transparently. 

5.3 Final Remarks and Future Work 

A new algorithm to perform user profiling in social net- 
works, McC-Splat, has been described. The new method 
is related to other known algorithms but, unlike them, it 
does not require ad hoc thresholds but, instead, it provides 
a number of alternatives to perform auto thresholding from 
the input labeled data. 

A rmmber of experiments were conducted to test its perfor- 
mance. Results from those experiments have been reported, 
revealing that McC- Splat largely outperforms a random clas- 
sifier and, in fact, achieves a notably high precision for very 
different classes and attributes. Nevertheless, further exper- 
iments are needed to determine which of the different "fla- 
vors" of the algorithm is the best choice, in addition to test 
the algorithm on data from OSNs other than Twitter, and 
labeled by different means. 

The attributes employed for the experiments are usually 
considered sensitive personal information and, thus, the ex- 
periments had an additional outcome: exposing the risk that 
acquaintances suppose for users which can be exposed even 
without revealing any personal information on themselves. 

Thereby, a prophylactic protocol to minimize leakages due 
to graph data mining was outlined. Further work is needed 
in this regard: a prototype implementation is highly needed; 
in addition to field studies regarding its use by real users, 
and analyzing its sensitiveness to different kind of attacks 
-mainly those based on infiltration. 
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Table 4: Performance figures for the six attributes and the four different "flavors" of the McC-Splat algorithm working on the Twitter dataset. 
Details for each individual class are provided in addition to aggregated figures: both micro- and macro-averaged. Micro-averaged precision 
is equivalent to the accuracy of the classifier for each attribute. Figures in bold correspond to "material" performance improvements against 
the random classifier —i.e. larger than 10%, according to the criterion proposed by [16j. 



Attribute 


Class 


Plain- vanilla 


Sink-absolute 


Sink- relative 


Percentile 


P R Fi 


P R Fi 


P R Fi 


P R Fi 


sex 


female 
male 


0.6307 0.1091 0.186 
0.6039 0.8852 0.718 


0.6149 0.3135 0.4153 
0.6803 0.4792 0.5623 


0.6125 0.3356 0.4337 
0.6907 0.4679 0.5579 


0.6174 0.2432 0.3489 
0.6765 0.3318 0.4452 


Macro-avg. 
Micro-avg 


0.6173 0.4972 0.5507 
0.606 0.564 0.5842 


0.6476 0.3964 0.4917 
0.6582 0.4106 0.5057 


0.6516 0.4018 0.497 
0.6623 0.4132 0.5089 


0.6469 0.2875 0.3981 
0.6551 0.2951 0.4069 


age 


teenage 
youngster 
young 
mid-age 
elder 


0.533 0.1392 0.2207 
0.4438 0.8697 0.5877 
0.3607 0.0574 0.0991 
0.3 0.0226 0.042 
0.5 0.0167 0.0323 


0.5112 0.1636 0.2478 
0.5 0.2267 0.312 
0.2825 0.1305 0.1786 
0.1441 0.1278 0.1355 
0.0857 0.1 0.0923 


0.5398 0.175 0.2644 
0.5375 0.1961 0.2873 
0.2458 0.1149 0.1566 
0.135 0.1654 0.1486 
0.0792 0.1333 0.0994 


0.5989 0.1564 0.248 
0.5464 0.1742 0.2641 
0.2411 0.0705 0.1091 
0.1897 0.0827 0.1152 
0.0476 0.0167 0.0247 


Macro-avg. 
Miero-avg. 


0.4275 0.2211 0.2915 
0.4486 0.4195 0.4336 


0.3047 0.1497 0.2008 
0.3932 0.1802 0.2472 


0.3075 0.1569 0.2078 
0.3743 0.1715 0.2353 


0.3247 0.1001 0.153 
0.4623 0.1404 0.2154 


religious affiliation 


atheist 
budhist 
christian 
Jewish 
muslim 


1 0.2576 0.4096 
1 0.2195 0.36 
0.9174 0.9186 0.918 
0.9592 0.5109 0.6667 
1 0.4571 0.6275 


0.4719 0.6364 0.5419 
0.6667 0.6341 0.65 
0.9899 0.7896 0.8785 
0.6495 0.6848 0.6667 
0.8 0.6857 0.7385 


0.4699 0.5909 0.5235 
0.4143 0.7073 0.5225 
0.9926 0.7409 0.8485 
0.617 0.6304 0.6237 
0.2747 0.7143 0.3968 


0.6579 0.3788 0.4808 
0.5769 0.3659 0.4478 
0.9928 0.768 0.8661 
0.8154 0.5761 0.6752 
0.8571 0.5143 0.6429 


Macro-avg. 
Miero-avg. 


0.9753 0.4727 0.6368 
0.9207 0.8507 0.8843 


0.7156 0.6861 0.7005 
0.927 0.7736 0.8434 


0.5537 0.6768 0.6091 
0.8734 0.7288 0.7946 


0.78 0.5206 0.6245 
0.9658 0.731 0.8322 


political orientation 


democrat 
republican 


1 0.26 0.4127 
0.9157 0.9583 0.9365 


0.85 0.34 0.4857 
0.944 0.9093 0.9263 


0.6905 0.58 0.6304 
0.973 0.8848 0.9268 


0.7045 0.62 0.6596 
0.9808 0.875 0.9249 


Macro-avg. 
Micro-avg. 


0.9579 0.6092 0.7447 
0.9182 0.8821 0.8998 


0.897 0.6247 0.7365 
0.9395 0.8472 0.8909 


0.8318 0.7324 0.7789 
0.9443 0.8515 0.8955 


0.8427 0.7475 0.7922 
0.951 0.8472 0.8961 


sexual orientation 


heterosexual 
homosexual 


1 
0.9892 0.9418 0.9649 


10 
1 0.8116 0.896 


10 
1 0.8116 0.896 


1 
1 0.8116 0.896 


Macro-avg. 
Micro-avg. 


0.9946 0.4709 0.6392 
0.9892 0.9322 0.9599 


1 0.4058 0.5773 
1 0.8034 0.891 


1 0.4058 0.5773 
1 0.8034 0.891 


1 0.4058 0.5773 
1 0.8034 0.891 


race/ethnicity 


asian-amcriean 

black 
hispanic 
native-american 
native-hawaiian 

white 


0.8571 0.4615 0.6 
0.9412 0.7805 0.8533 

1 
0.9231 0.75 0.8276 
1 
1 


0.8571 0.4615 0.6 
0.9412 0.7805 0.8533 

10 
0.9231 0.75 0.8276 
10 
10 


0.75 0.4615 0.5714 
0.9412 0.7805 0.8533 

10 
0.9167 0.6875 0.7857 
10 
10 


0.8571 0.4615 0.6 
0.9412 0.7805 0.8533 

1 
0.9231 0.75 0.8276 
10 
1 


Macro-avg. 
Micro-avg. 


0.9536 0.332 0.4925 
0.9259 0.641 0.7576 


0.9536 0.332 0.4925 
0.9259 0.641 0.7576 


0.9347 0.3216 0.4785 
0.9074 0.6282 0.7424 


0.9536 0.332 0.4925 
0.9259 0.641 0.7576 



Table 5: Performance figures of the four "flavors" of the McC-Splat algorithm working on the WeFollow dataset. Bold figures correspond to 
performance differences above 10% when comparing against the random classifier (see table [3)) ■ 



Attribute 


Class 


Plain- vanilla 


Sink-absolute 


Sink-relative 


Percentile 


P R. Fi 


P R Fi 


P R Fi 


P R Fi 


religious affiliation 


atheist 
budhist 
christian 
j ewish 
muslim 


0.9795 0.1496 0.2595 
0.95 0.0759 0.1406 
0.8705 0.3084 0.4555 

0.973 0.0633 0.1188 


0.8615 0.3117 0.4577 
0.7349 0.1625 0.2661 
0.9878 0.2643 0.417 

U. 00.^4 U..^ 141:41: 1 

0.6667 0.0984 0.1715 


0.8355 0.3062 0.4481 
0.5447 0.1864 0.2778 
0.9982 0.2411 0.3897 

U.Do.^iJ IJ.ZliD U.Olfl 

0.2331 0.109 0.1485 


0.8735 0.2326 0.3673 
0.7034 0.1358 0.2277 
0.9983 0.2518 0.4021 
n n iQfii n "^19/1 

0.6849 0.0879 0.1558 


^lacro-avg. 
Micro-avg. 


LI.olO U.-LvJ-Si U.^UO-L 

0.8799 0.2662 0.4088 


n 7sn7 n9in'^ n'^'^i'^ 
0.9361 0.2543 0.4 


n fizisQ (19111 n'^iRf^ 

LI.0'±O£7 U..^-L-L-L U.O-LOtJ 

0.8768 0.2382 0.3746 


U.OUOO U.-LOUO VLiiiJO^ 

0.9566 0.2351 0.3774 


political orientation 


democrat 
republican 


1 0.0478 0.0913 
0.9178 0.3717 0.5291 


0.9429 0.0686 0.1279 
0.9338 0.3602 0.5199 


0.7899 0.1954 0.3133 
0.9778 0.3536 0.5194 


0.7638 0.2017 0.3191 
0.9831 0.3502 0.5164 


Macro-avg. 
Micro-avg. 


0.9589 0.2098 0.3442 
0.9191 0.3324 0.4882 


0.9384 0.2144 0.349 
0.934 0.3248 0.482 


0.8839 0.2745 0.4189 
0.9616 0.3344 0.4963 


0.8735 0.276 0.4194 
0.9627 0.3322 0.4939 


sexual orientation 


heterosexual 
homosexual 


111 
1 0.2685 0.4234 


111 
1 0.2415 0.389 


10 

1 0.2402 0.3874 


1 

1 0.2402 0.3874 


Macro-avg. 
Micro-avg. 


1 0.6343 0.7762 
1 0.2685 0.4234 


1 0.6208 0.766 
1 0.2415 0.389 


0.5 0.6201 0.5536 
0.9947 0.2402 0.387 


0.5 0.6201 0.5536 
0.9965 0.2402 0.3871 


race/cthnicity 


asian-american 

black 
hispanic 
n at i ve- amer i c an 
native-hawaiian 

white 


0.8571 0.035 0.0672 
0.6818 0.1624 0.2623 

1 
0.4828 0.0897 0.1514 
10 
10 


0.8571 0.035 0.0672 
0.6818 0.1624 0.2623 

10 
0.4828 0.0897 0.1514 
10 
10 


0.8571 0.035 0.0672 
0.6919 0.161 0.2613 

10 
0.4375 0.0897 0.1489 
1 
1 


0.8913 0.0341 0.0658 
0.6722 0.1637 0.2633 

1 
0.4483 0.0833 0.1405 
1 
1 


Macro-avg. 
Micro-avg. 


0.5036 0.3812 0.4339 
0.7078 0.0547 0.1015 


0.5036 0.3812 0.4339 
0.7078 0.0547 0.1015 


0.4978 0.381 0.4316 
0.7045 0.0544 0.101 


0.502 0.3802 0.4327 
0.7036 0.0541 0.1006 



